HEAD ======= >>>>>>> b33c1465500666e1f6c0f3311b74d3fdb73c35f8
In contemporary times, social media has assumed a multifaceted role. Individuals utilize platforms like YouTube not only for educational purposes but also for listening to music and staying informed with current news. This versatility makes YouTube a crucial player in shaping digital culture and trends. Consequently, understanding the factors that lead to a YouTuber’s success is vital, as it influences not only individual creators but also the broader media landscape. By analyzing data from top YouTubers across various countries, the project aims to provide valuable insights for content creators and advertisers to enhance their engagement and profitability. The primary dataset obtained from Kaggle comprises information from the top 995-ranked YouTubers in the global YouTube community. We utilize advanced data visualization techniques to deeply analyze top YouTubers’ profiles, examining their demographic distribution and the relationships among various success-related variables. Moreover, our project includes the development of a predictive model designed to quantify and forecast YouTubers’ success, offering valuable insights into the future trends of content creation and audience engagement on the platform.
We imported the data and apply the function
janitor::clean_names to convert all the variable names to
lower case and puts underscores in the gaps. Since YouTube was created
in 2005, we eliminated 6 accounts created before 2005, which might be
data entry error. The ‘country’ and ‘category’ variables have been
encoded as factors to facilitate categorical analysis. We also add two
new variables ‘video_per_upload’ and ‘earning_differences’. For the
consideration of data integrity, NA are kept for not a
significant reduction in dataset size, and we will process the NA values
respecively in each visualization and model fitting. The cleaned dataset
comprises 588 observations across 20 variables which are integral for
our exploratory data analysis and model fitting processes:
idsubscribers: Number of subscribers to the channelvideo views: Total views across all videos on the
channelcategory: Category or niche of the channeluploads: Total number of videos uploaded on the
channelcountry: Country where the YouTube channel
originateschannel_type: Type of the YouTube channel (e.g.,
individual, brand)video_views_rank: Ranking of the channel based on total
video viewscountry_rank: Ranking of the channel based on the
number of subscribers within its countrychannel_type_rank: Ranking of the channel based on its
type (individual or brand)video_views_for_the_last_30_days: Total video views in
the last 30 dayslowest_monthly_earnings: Lowest estimated monthly
earnings from the channelhighest_monthly_earnings: Highest estimated monthly
earnings from the channellowest_yearly_earnings: Lowest estimated yearly
earnings from the channelhighest_yearly_earnings: Highest estimated yearly
earnings from the channelsubscribers_for_last_30_days: Number of new subscribers
gained in the last 30 dayscreated_year: Year when the YouTube channel was
createdPopulation: Total population of the countrylatitude: Latitude coordinate of the country’s
locationlongitude: Longitude coordinate of the country’s
locationvideo_per_upload: Average number of video views per
video uploadearning_differences: Range of yearly earnings for each
channel, calculated by subtracting lowest_yearly_earnings
from highest_yearly_earnings.# Make sure 'country' and 'category' are factors.
cleaned_df$country <- as.factor(cleaned_df$country)
cleaned_df$category <- as.factor(cleaned_df$category)
# Add new variables 'video_per_upload' and 'earning_differences'
cleaned_df$video_per_upload <- with(cleaned_df, video_views / uploads)
cleaned_df$earning_differences <- with(cleaned_df, highest_yearly_earnings - lowest_yearly_earnings)
In order to preserve data integrity and avoid a significant reduction
in dataset size, we have opted to retain NA values within
the dataset. Each instance of NA will be addressed
individually in subsequent stages of our analysis, ensuring that they
are appropriately managed during both the visualization and model
fitting processes.
channel_counts_by_location <- cleaned_df|>
drop_na(c(latitude, longitude)) |>
group_by(country, longitude, latitude) |>
summarise(channel_count = n())
world_map <- leaflet() |>
addTiles() |>
addMarkers(
data = channel_counts_by_location,
~longitude, ~latitude,
label = ~paste(country, ": ", channel_count, " channels"),
popup = TRUE
)
world_map
We find that the top five ranked YouTubers are from the United States (N = 311), followed by India (N = 168), Brazil (N = 61), the United Kingdom (N = 43), and Mexico (N = 33).
youtube_df <- cleaned_df
all_columns <- colnames(youtube_df)
columns_to_plot <- all_columns[!all_columns %in% c("id", "category","country","abbreviation","channel_type","population","latitude","longitude","created_year")]
numeric_data_long <-
youtube_df[, columns_to_plot] %>%
gather(key = "variable", value = "value")
# Create a single plot with facets for each numeric variable
p <- ggplot(numeric_data_long, aes(x = value)) +
geom_histogram(aes(y = ..density..),bins = 15, fill = "#8dab7f", alpha = 0.8) +
geom_density(color="#6b8e23")+
facet_wrap(~ variable, scales = "free", ncol = 3) +
scale_x_continuous(labels = scales::comma) +
theme_minimal(base_size = 10) +
theme(
strip.text.x = element_text(size = 10, face = "bold"),
axis.text.x = element_text(angle = 20, hjust = 1, vjust = 1,size=7,face = "bold"), # Angle x-axis labels for readability
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
plot.title = element_text(size = 16, face = "bold"),
plot.margin = margin(1, 1, 1, 1, "cm"), # Adjust the plot margins
strip.background = element_blank(),
panel.spacing = unit(3, "lines")
) +
labs(
title = "Distribution of Numeric Variables",
x = "Value",
y = "Frequency",
caption = "Source: YouTube Data"
)
# Convert to an interactive plot
ggplotly(p)
<<<<<<< HEAD
=======
>>>>>>> b33c1465500666e1f6c0f3311b74d3fdb73c35f8
We create interactive plots by applying ploty to
visualize the density distribution of numerical variables. Upon
observing right skewness, we apply a logarithmic transformation to these
numeric values.
p <- ggplot(numeric_data_long, aes(x = log(value+1))) +
geom_histogram(aes(y = ..density..),bins = 15, fill = "#8dab7f", alpha = 0.8) +
geom_density(color="#6b8e23")+
facet_wrap(~ variable, scales = "free", ncol = 3) +
scale_x_continuous(labels = scales::comma) +
theme_minimal(base_size = 10) +
theme(
strip.text.x = element_text(size = 10, face = "bold"),
axis.text.x = element_text(angle = 20, hjust = 1, vjust = 1,size=7,face = "bold"), # Angle x-axis labels for readability
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
plot.title = element_text(size = 16, face = "bold"),
plot.margin = margin(1, 1, 1, 1, "cm"), # Adjust the plot margins
strip.background = element_blank(),
panel.spacing = unit(3, "lines")
) +
labs(
title = "Distribution of Numeric Variables",
x = "Value",
y = "Frequency",
caption = "Source: YouTube Data"
)
# Convert to an interactive plot
ggplotly(p)
<<<<<<< HEAD
# Calculate the correlation matrix
cor_matrix <- cor(youtube_df[, columns_to_plot], use = "complete.obs")
fig <- plot_ly(x = colnames(cor_matrix), y = rownames(cor_matrix), z = cor_matrix,
type = "heatmap",colorscale ="Greens" , zmin = -1, zmax = 1)
fig <- fig %>% layout(
yaxis = list(autorange = "reversed"),
width=800,
height=600,
title = "Correlation Matrix")
fig
The heat map depicts the Pearson correlation between continuous
variables, which reveals a relatively high correlation between the
variables Subscribers and Video Views (\(r\) = 0.85). The correlation of these two
variables with the others is at a moderately weak level (\(r\) around 0.46), with no correlation to
the Uploads variable (\(r\) = 0.08 and 0.15). Notably, the
variables Lowest Earnings by year and month and
Highest Earnings by year and month exhibit an absolute
correlation of nearly 100%.
year_created_plot <- plot_ly(cleaned_df, x = ~created_year, type = "histogram",
marker = list(color = "#B3CDD1", line = list(color = "white", width = 1)),
nbinsx = 30)
# Update layout
year_created_plot <- year_created_plot |>
layout(
title = "Distribution of Channel Creation Years",
xaxis = list(title = "Year of Creation"),
yaxis = list(title = "Number of Channels"),
showlegend = FALSE,
template = "plotly_white" # Optional: Set a template for the plot
)
year_created_plot
<<<<<<< HEAD
The summary plot we generated showcases the relationship between the year of channel creation on the x-axis and the corresponding number of channels on the y-axis. Notably, the data reveals a pronounced peak in the year 2014, with 66 channels coming into existence during that period. This peak suggests a surge in YouTube channel creation, indicating potential shifts in content creation trends, platform popularity, or other influential factors during that specific year. Furthermore, our analysis indicates a sustained period of notable channel creation from 2011 to 2016, highlighting a consistent and relatively high annual rate of channel initiation within this timeframe. - Word Cloud
=======There is an initial growth in the number of channels created from 2005 onwards, which is expected as YouTube was founded in February 2005 and gradually gained popularity.
A peak in channel creation appears to occur in the early 2010s, which may correspond with YouTube’s rise in global accessibility and the platform becoming a viable career option for content creators.
Post-2015, there’s a noticeable decline in new channel creation. This could be due to market saturation or content creators choosing to diversify onto emerging platforms.
category_data <- youtube_df %>%
filter(!is.na(category) & category != "nan") %>%
count(category) %>%
mutate(n=n*30) %>%
ungroup()
category_data$scaled_size <- log(category_data$n + 1) # adding 1 to avoid log(0)
wordcloud_plot <- ggplot(category_data, aes(label = category, size = scaled_size)) +
geom_text_wordcloud(
aes(color = n),
shape = 'circle',
rm_outside = TRUE
) +
scale_size_area(max_size = 10) +
scale_color_gradient(low = "#ffcc99", high = "#8dab7f") +
theme_void(base_family = "sans") +
theme(legend.position = "none",
plot.margin = margin(1, 1, 1, 1, "cm")) # Adjust margins around the plot
# Display the plot
wordcloud_plot

We exclude the NaN values in the category, and modify the frequency
n of each category. The most frequently used categories, as
observed from the word cloud chart, include Entertainment,
Music, People & Blogs, and
Gaming.
channel_type_counts <- table(youtube_df$channel_type)
channel_type_counts <- youtube_df %>%
group_by(channel_type) %>%
summarise(count = n()) %>%
ungroup()
color <- c("#ffcc99","#ffe4b5", "#ffd180","#ffa07a","#d1d17a", "#8dab7f", "#D2DFD9", "#A8C0B5", "#D1B9CB", "#B3CDD1", "#BBC1D0", "#E8C3C3","#C7CEBD", "#D2DFD9","#6b8e23")
# Create a pie chart using plotly with the custom colors
fig <- plot_ly(channel_type_counts, labels = ~channel_type, values = ~count, type = 'pie',
textinfo = 'label+percent',
insidetextorientation = 'radial',
marker = list(colors = color))
fig %>%
layout(title = 'Pie Chart of Channel Types',
showlegend = FALSE,
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
<<<<<<< HEAD
=======
>>>>>>> b33c1465500666e1f6c0f3311b74d3fdb73c35f8
In this interactive pie chart, we demonstrate the frequency of
proportion in Channel type. The most frequently viewed channel, as
observed from the word cloud chart, include Entertainment,
Music, People & Blogs, Gaming
and Comedy.
size_factor <- 10^(-8.7) # Adjust this factor as needed to scale sizes up or down
youtube_df %>%
filter(!is.na(category) & category != "nan") %>%
plot_ly(
x = ~subscribers,
y = ~video_views,
size = ~video_views_for_the_last_30_days * size_factor,
color = ~category,
text = ~category,
hoverinfo = 'text+x+y',
type = 'scatter',
mode = 'markers',
marker = list(
sizemode = 'area',
sizeref = 2 * max(youtube_df$video_views_for_the_last_30_days * size_factor)/100
)
) %>%
layout(
title = 'Subscribers vs. Video Views by Category',
xaxis = list(type = 'log', title = 'Subscribers'),
yaxis = list(type = 'log', title = 'Video Views (in billions)'),
hovermode = 'closest',
showlegend = TRUE
)
=======
summary_data <- cleaned_df |>
mutate(earning_diff = highest_yearly_earnings - lowest_yearly_earnings)|>
filter(category != "nan")|>
group_by(category) |>
summarize(mean_earning_diff = mean(earning_diff),
median_earning_diff = median(earning_diff),)
# Create a bar plot using Plotly
plot <- plot_ly(data = summary_data, x = ~category, type = 'bar',
y = ~mean_earning_diff, name = 'Mean Earning Difference',marker = list(color = "#F4E1C1")) %>%
add_trace(y = ~median_earning_diff, name = 'Median Earning Difference', marker = list(color = "#C7CEBD"))
# Add layout details
plot <- plot|>
layout(title = 'Earning Difference Summary by Category',
xaxis = list(title = 'Category'),
yaxis = list(title = 'Earning Difference'),
legend = list(orientation = 'h', x = 0.5, y = -0.3, xanchor = 'center', yanchor = 'top'))
# Show the plot
plotly::layout(plot, layout)
For most categories, the mean earning difference is higher than the median, suggesting that a few channels with substantially higher earnings may be skewing the mean upwards.
color <- c("#ffe4b5", "#ffa07a","#d1d17a" , "#D2DFD9", "#A8C0B5", "#D1B9CB", "#B3CDD1", "#BBC1D0", "#E8C3C3")
earning_plot_data <-
read_csv("Data/Global YouTube Statistics.csv",locale = locale(encoding = "Windows-1252")) %>%
janitor::clean_names() %>%
drop_na() %>%
select(youtuber, channel_type, highest_yearly_earnings) %>%
mutate(youtuber = stringi::stri_replace_all_regex(youtuber, "[^\x01-\x7F]", "")) %>%
arrange(desc(highest_yearly_earnings)) %>%
top_n(15, highest_yearly_earnings)
plot_ly(earning_plot_data, x = ~highest_yearly_earnings, y = ~youtuber,
type = 'bar', orientation = 'h',
color = ~channel_type, colors = color,
text = ~paste('$', formatC(highest_yearly_earnings, format = "d", big.mark = ",")),
textposition = 'inside',
insidetextanchor = 'end',
textfont = list(color = 'white'), # text color
hoverinfo = 'text',
hovertemplate = paste('<b>Youtuber:</b> %{y}<br>',
'<b>Earnings:</b> $%{x}<extra></extra>')) %>%
layout(title = 'Top 15 YouTube Channels by Highest Yearly Earnings',
xaxis = list(title = 'Yearly Earnings ($)'),
yaxis = list(title = ''),
showlegend = TRUE,
margin = list(l = 100, r = 25, t = 50, b = 50),
font = list(family = "Arial, sans-serif", size = 12, color = "#333333"))
<<<<<<< HEAD
=======
>>>>>>> b33c1465500666e1f6c0f3311b74d3fdb73c35f8
For the Top 15 YouTube Channels by Highest Yearly Earnings, first and foremost, KIMPO has the highest earnings at 163,400,400 US dollars in 2023, which is triple of the lowest earnings(59,800,000 dollars) of dednahype. Secondly, Entertainment is still the most predominant category among these channels. In the 15 YouTube Channels, animal and comedy have respectably only one position.
perUploadData <- cleaned_df |>
mutate(viewsPerUpload = video_views / uploads) |>
filter(viewsPerUpload < 600000000)
scatter_plot_perUpLoad <- plot_ly(data = perUploadData, x = ~viewsPerUpload, y = ~highest_yearly_earnings, mode = 'markers')
scatter_plot_perUpLoad <- scatter_plot_perUpLoad |>
layout(title = 'Relationship between Views per Upload and Highest Yearly Earning',
xaxis = list(title = 'Views per Upload'),
yaxis = list(title = 'Highest Yearly Earnings'))
scatter_plot_perUpLoad
The scatterplot illustrates a discernible but weak downward slope between the highest earning and the number of YouTube views per upload. This implies that, in general, channels with a higher number of views per upload tend to have slightly lower earnings. While the relationship is present, its strength is limited, indicating that other factors beyond views per upload are likely influencing the earnings of YouTube channels.
channel_counts_by_location <- cleaned_df|>
drop_na(c(latitude, longitude)) |>
group_by(country, longitude, latitude) |>
summarise(channel_count = n())
world_map <- leaflet() |>
addTiles() |>
addMarkers(
data = channel_counts_by_location,
~longitude, ~latitude,
label = ~paste(country, ": ", channel_count, " channels"),
popup = TRUE
)
world_map
We find that the top five ranked YouTubers are from the United States (N = 311), followed by India (N = 168), Brazil (N = 61), the United Kingdom (N = 43), and Mexico (N = 33).
cleaned_youtube_df <- cleaned_df %>%
filter(created_year >= 2005, !is.na(category), category != "nan")
world <- map_data("world")
world <- world %>%
mutate(region = ifelse(region == "USA", "United States", region),
region = ifelse(region == "UK", "United Kingdom", region))
popular_category <- cleaned_youtube_df %>%
group_by(country) %>%
summarise(most_popular_category = names(sort(table(category), decreasing = TRUE)[1])) %>%
ungroup()
world.plus1 <- world %>%
left_join(popular_category, by = c("region" = "country")) %>%
filter(region != "Antarctica")
world.plus1$most_popular_category[is.na(world.plus1$most_popular_category)] <- "Unknown"
categories <- unique(world.plus1$most_popular_category)
color <- c("#ffd180","#ffa07a","#d1d17a", "#8dab7f", "#D2DFD9", "#A8C0B5", "#D1B9CB", "#B3CDD1", "#BBC1D0", "#E8C3C3","#C7CEBD", "#D2DFD9","#6b8e23")
category_colors <- setNames(color, categories)
ggplot(data = world.plus1, mapping = aes(x = long, y = lat, group = group)) +
coord_fixed(1.3) +
geom_polygon(aes(fill = most_popular_category)) +
ggtitle("Most Popular YouTube Content Categories by Country") +
scale_fill_manual(values = category_colors) +
theme_minimal() +
theme(
axis.text = element_blank(),
axis.ticks = element_blank(),
plot.title = element_text(hjust = 0.5),
legend.title = element_blank(),
legend.position = "bottom"
)
YouTube content preferences vary widely by country, reflecting cultural
diversity in digital consumption. Music, People & Blogs are the most
popular genres across world. The presence of an “Unknown” category could
indicate data gaps or emerging genres yet to be classified. The map
underscores the importance of cultural context in content creation and
consumption on YouTube.
size_factor <- 10^(-8.7) # Adjust this factor as needed to scale sizes up or down
youtube_df %>%
filter(!is.na(category) & category != "nan") %>%
plot_ly(
x = ~subscribers,
y = ~video_views,
size = ~video_views_for_the_last_30_days * size_factor,
color = ~category,
text = ~category,
hoverinfo = 'text+x+y',
type = 'scatter',
mode = 'markers',
marker = list(
sizemode = 'area',
sizeref = 2 * max(youtube_df$video_views_for_the_last_30_days * size_factor)/100
)
) %>%
layout(
title = 'Subscribers vs. Video Views by Category',
xaxis = list(type = 'log', title = 'Subscribers'),
yaxis = list(type = 'log', title = 'Video Views (in billions)'),
hovermode = 'closest',
showlegend = TRUE
)
The bubble chart implies the relationship between the number of subscribers (on the x-axis) and video views (on the y-axis) for YouTube channels: more subscribers tend to have a higher number of total video views. The size of the bubbles represent the video views in the last 30 days, and color hue represents the category. Certain categories like Music, Entertainment, and Gaming appear more frequently among the channels with higher views and subscribers, reflecting their popularity on Youtube.
perUploadData <- cleaned_df |>
mutate(viewsPerUpload = video_views / uploads) |>
filter(viewsPerUpload < 600000000)
scatter_plot_perUpLoad <- ggplot(data = perUploadData, aes(x = viewsPerUpload, y = highest_yearly_earnings)) +
geom_point(alpha=0.7,color="#8dab7f") +
geom_smooth(method = "lm", color = "#ffa07a") +
labs(title = 'Scatter Plot of Highest Yearly Earnings vs. Views per Upload for Content Creators',
x = 'Views per Upload',
y = 'Highest Yearly Earnings')+
theme_minimal()
# Print the plot
print(scatter_plot_perUpLoad)

The scatterplot illustrates a discernible but weak downward slope between the highest earning and the number of YouTube views per upload. This implies that, in general, channels with a higher number of views per upload tend to have slightly lower earnings. While the relationship is present, its strength is limited, indicating that other factors beyond views per upload are likely influencing the earnings of YouTube channels.
Model statement: \(subscribers =\beta_0+ \beta_1country +\beta_2category +\beta_3videoperupload + \beta_4uploads + \beta_5(videoviews)\)
youtube_df <- youtube_df %>% drop_na()
# Fit the Multiple Linear Regression model++uploads +video_views, data = youtube_df)
mlr_model <- lm(subscribers ~ country + category + video_per_upload +uploads +video_views , data = youtube_df)
youtube_df %>%
modelr::add_predictions(mlr_model) %>%
ggplot(aes(x = earning_differences, y = pred)) +
geom_point() +
labs(
title = "Multivariate Linear Model",
x = "earning_differences",
y = "subscribers") +
theme_pubclean()
<<<<<<< HEAD

check_model(mlr_model, check = c("linearity", "outliers", "qq", "normality"))


check_model(mlr_model, check = c("linearity", "outliers", "qq", "normality"))

# Summary of the model
mlr_model%>%
broom::tidy()%>%
knitr::kable(digits=3)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 9376280.100 | 7999090.827 | 1.172 | 0.242 |
| countryAustralia | -5472212.306 | 8078440.308 | -0.677 | 0.498 |
| countryBarbados | 8869689.318 | 10470708.869 | 0.847 | 0.397 |
| countryBrazil | 1063831.738 | 3513446.361 | 0.303 | 0.762 |
| countryCanada | 739423.789 | 5202276.951 | 0.142 | 0.887 |
| countryChile | 9890517.481 | 6702783.611 | 1.476 | 0.141 |
| countryChina | 4306410.207 | 10860207.744 | 0.397 | 0.692 |
| countryColombia | 628666.001 | 4555343.991 | 0.138 | 0.890 |
| countryCuba | -15208894.389 | 39353393.646 | -0.386 | 0.699 |
| countryEcuador | 482942.941 | 7727581.887 | 0.062 | 0.950 |
| countryEgypt | 29112.446 | 7869216.335 | 0.004 | 0.997 |
| countryEl Salvador | 26539013.988 | 10547644.419 | 2.516 | 0.012 |
| countryFrance | 855.337 | 5886668.193 | 0.000 | 1.000 |
| countryGermany | -3845926.857 | 5872522.560 | -0.655 | 0.513 |
| countryIndia | 2230051.232 | 3157552.062 | 0.706 | 0.480 |
| countryIndonesia | 3667971.184 | 3822919.512 | 0.959 | 0.338 |
| countryItaly | -705999.140 | 7722280.705 | -0.091 | 0.927 |
| countryJapan | -4897660.367 | 5853688.097 | -0.837 | 0.403 |
| countryJordan | -5716738.222 | 6535116.343 | -0.875 | 0.382 |
| countryKuwait | 16638412.903 | 10548675.332 | 1.577 | 0.115 |
| countryLatvia | -10711499.813 | 10504680.653 | -1.020 | 0.308 |
| countryMalaysia | -1391640.694 | 10592297.201 | -0.131 | 0.896 |
| countryMexico | 2603524.238 | 3914550.883 | 0.665 | 0.506 |
| countryNetherlands | 1117730.024 | 7732549.460 | 0.145 | 0.885 |
| countryPakistan | -1808473.884 | 5213982.019 | -0.347 | 0.729 |
| countryPhilippines | -282333.291 | 4837637.524 | -0.058 | 0.953 |
| countryRussia | -46260.280 | 4147480.887 | -0.011 | 0.991 |
| countrySamoa | -5072846.712 | 10512137.650 | -0.483 | 0.630 |
| countrySaudi Arabia | 46119.008 | 5158210.195 | 0.009 | 0.993 |
| countrySingapore | -6983564.461 | 7751747.825 | -0.901 | 0.368 |
| countrySouth Korea | 11422156.311 | 4532264.675 | 2.520 | 0.012 |
| countrySpain | -701473.589 | 4377564.892 | -0.160 | 0.873 |
| countrySweden | 840319.852 | 7781547.211 | 0.108 | 0.914 |
| countrySwitzerland | -1172193.393 | 11106777.278 | -0.106 | 0.916 |
| countryThailand | -3951173.515 | 4195676.870 | -0.942 | 0.347 |
| countryTurkey | -12238300.878 | 6562702.799 | -1.865 | 0.063 |
| countryUkraine | -1572524.656 | 5485703.370 | -0.287 | 0.774 |
| countryUnited Arab Emirates | 1283973.834 | 5019181.377 | 0.256 | 0.798 |
| countryUnited Kingdom | 267042.876 | 3617711.558 | 0.074 | 0.941 |
| countryUnited States | 1231887.262 | 3131795.390 | 0.393 | 0.694 |
| countryVenezuela | 10814024.934 | 10522436.612 | 1.028 | 0.305 |
| countryVietnam | -3329160.717 | 7703918.031 | -0.432 | 0.666 |
| categoryComedy | 1089754.640 | 7534028.409 | 0.145 | 0.885 |
| categoryEducation | 831486.212 | 7619585.450 | 0.109 | 0.913 |
| categoryEntertainment | 1300231.209 | 7450481.786 | 0.175 | 0.862 |
| categoryFilm & Animation | 884298.396 | 7619659.355 | 0.116 | 0.908 |
| categoryGaming | 66210.665 | 7541954.178 | 0.009 | 0.993 |
| categoryHowto & Style | 974932.248 | 7962424.542 | 0.122 | 0.903 |
| categoryMovies | 6606348.817 | 10262823.888 | 0.644 | 0.520 |
| categoryMusic | 1582993.394 | 7462956.180 | 0.212 | 0.832 |
| categorynan | 520376.447 | 7553848.246 | 0.069 | 0.945 |
| categoryNews & Politics | 3920532.241 | 7970252.886 | 0.492 | 0.623 |
| categoryNonprofits & Activism | 14737864.453 | 10298596.694 | 1.431 | 0.153 |
| categoryPeople & Blogs | 1346729.750 | 7472123.599 | 0.180 | 0.857 |
| categoryPets & Animals | -2129591.804 | 9412383.766 | -0.226 | 0.821 |
| categoryScience & Technology | 4641578.970 | 7991011.762 | 0.581 | 0.562 |
| categoryShows | -2025300.247 | 8008875.468 | -0.253 | 0.800 |
| categorySports | 6076125.766 | 8283199.028 | 0.734 | 0.464 |
| categoryTrailers | 11292393.086 | 10264244.021 | 1.100 | 0.272 |
| video_per_upload | 0.001 | 0.002 | 0.782 | 0.434 |
| uploads | -27.510 | 13.321 | -2.065 | 0.039 |
| video_views | 0.001 | 0.000 | 36.598 | 0.000 |
mlr_model %>%
broom::glance() %>%
knitr::kable(digits=3)
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.754 | 0.725 | 9982695 | 26.429 | 0 | 61 | -10278 | 20682 | 20957.73 | 5.241811e+16 | 526 | 588 |
Model statement: \(earningdifferences =\beta_0+ \beta_1country +\beta_2category +\beta_3videoperupload + beta_4uploads + beta_5(videoviews)\)
mlr_model_1 <- lm(earning_differences ~ country + category + video_per_upload +uploads +video_views , data = youtube_df)
youtube_df %>%
modelr::add_predictions(mlr_model_1) %>%
ggplot(aes(x = earning_differences, y = pred)) +
geom_point() +
labs(
title = "Multivariate Linear Model",
x = "earning_differences",
y = "predictions") +
theme_pubclean()
<<<<<<< HEAD

check_model(mlr_model_1, check = c("linearity", "outliers", "qq", "normality"))


check_model(mlr_model_1, check = c("linearity", "outliers", "qq", "normality"))

# Summary of the model
mlr_model_1%>%
broom::tidy() %>%
knitr::kable(digits=3)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 7505843.615 | 9623358.716 | 0.780 | 0.436 |
| countryAustralia | 3286609.208 | 9718820.630 | 0.338 | 0.735 |
| countryBarbados | -7937853.449 | 12596855.022 | -0.630 | 0.529 |
| countryBrazil | -651672.832 | 4226874.703 | -0.154 | 0.878 |
| countryCanada | -2344415.371 | 6258633.427 | -0.375 | 0.708 |
| countryChile | -5990012.829 | 8063827.812 | -0.743 | 0.458 |
| countryChina | -3523664.107 | 13065444.200 | -0.270 | 0.788 |
| countryColombia | 1621029.024 | 5480336.485 | 0.296 | 0.768 |
| countryCuba | 32634726.214 | 47344358.496 | 0.689 | 0.491 |
| countryEcuador | -4753540.336 | 9296718.104 | -0.511 | 0.609 |
| countryEgypt | -4685984.291 | 9467112.356 | -0.495 | 0.621 |
| countryEl Salvador | -7529566.868 | 12689412.841 | -0.593 | 0.553 |
| countryFrance | -5545773.340 | 7081994.802 | -0.783 | 0.434 |
| countryGermany | 5023180.904 | 7064976.807 | 0.711 | 0.477 |
| countryIndia | -1103754.350 | 3798713.731 | -0.291 | 0.772 |
| countryIndonesia | -3702317.503 | 4599188.407 | -0.805 | 0.421 |
| countryItaly | 17725595.980 | 9290340.483 | 1.908 | 0.057 |
| countryJapan | 6095892.544 | 7042317.882 | 0.866 | 0.387 |
| countryJordan | -7667463.169 | 7862114.605 | -0.975 | 0.330 |
| countryKuwait | 973103.211 | 12690653.087 | 0.077 | 0.939 |
| countryLatvia | 37811642.749 | 12637724.999 | 2.992 | 0.003 |
| countryMalaysia | -4297556.430 | 12743132.664 | -0.337 | 0.736 |
| countryMexico | -3513157.824 | 4709426.130 | -0.746 | 0.456 |
| countryNetherlands | -5223709.855 | 9302694.376 | -0.562 | 0.575 |
| countryPakistan | 7363090.765 | 6272715.286 | 1.174 | 0.241 |
| countryPhilippines | -7033588.711 | 5819951.572 | -1.209 | 0.227 |
| countryRussia | -3850992.022 | 4989654.101 | -0.772 | 0.441 |
| countrySamoa | -4268565.526 | 12646696.189 | -0.338 | 0.736 |
| countrySaudi Arabia | -7073939.302 | 6205618.627 | -1.140 | 0.255 |
| countrySingapore | -13127214.056 | 9325791.095 | -1.408 | 0.160 |
| countrySouth Korea | 18564877.388 | 5452570.762 | 3.405 | 0.001 |
| countrySpain | -3992247.305 | 5266458.173 | -0.758 | 0.449 |
| countrySweden | 1087634.776 | 9361641.442 | 0.116 | 0.908 |
| countrySwitzerland | -5560344.074 | 13362081.296 | -0.416 | 0.677 |
| countryThailand | -9277039.168 | 5047636.594 | -1.838 | 0.067 |
| countryTurkey | 9189699.113 | 7895302.673 | 1.164 | 0.245 |
| countryUkraine | -5152068.187 | 6599611.441 | -0.781 | 0.435 |
| countryUnited Arab Emirates | 2677976.449 | 6038359.096 | 0.443 | 0.658 |
| countryUnited Kingdom | -4138003.116 | 4352311.632 | -0.951 | 0.342 |
| countryUnited States | -2187652.475 | 3767726.998 | -0.581 | 0.562 |
| countryVenezuela | -9302188.332 | 12659086.424 | -0.735 | 0.463 |
| countryVietnam | -1064136.589 | 9268249.147 | -0.115 | 0.909 |
| categoryComedy | 290642.811 | 9063862.322 | 0.032 | 0.974 |
| categoryEducation | -5143240.331 | 9166792.282 | -0.561 | 0.575 |
| categoryEntertainment | -2003594.778 | 8963351.009 | -0.224 | 0.823 |
| categoryFilm & Animation | -2426443.611 | 9166881.194 | -0.265 | 0.791 |
| categoryGaming | -3616205.562 | 9073397.471 | -0.399 | 0.690 |
| categoryHowto & Style | -4393500.858 | 9579247.101 | -0.459 | 0.647 |
| categoryMovies | -5226411.984 | 12346757.631 | -0.423 | 0.672 |
| categoryMusic | -5400621.796 | 8978358.410 | -0.602 | 0.548 |
| categorynan | 4461079.060 | 9087706.707 | 0.491 | 0.624 |
| categoryNews & Politics | -4810603.965 | 9588665.042 | -0.502 | 0.616 |
| categoryNonprofits & Activism | -7024002.689 | 12389794.340 | -0.567 | 0.571 |
| categoryPeople & Blogs | -2733019.699 | 8989387.334 | -0.304 | 0.761 |
| categoryPets & Animals | 1338221.067 | 11323630.062 | 0.118 | 0.906 |
| categoryScience & Technology | -5132121.348 | 9613639.143 | -0.534 | 0.594 |
| categoryShows | 805997.958 | 9635130.193 | 0.084 | 0.933 |
| categorySports | -2361036.909 | 9965156.952 | -0.237 | 0.813 |
| categoryTrailers | -10812255.538 | 12348466.131 | -0.876 | 0.382 |
| video_per_upload | -0.002 | 0.002 | -1.054 | 0.292 |
| uploads | 25.544 | 16.026 | 1.594 | 0.112 |
| video_views | 0.001 | 0.000 | 15.541 | 0.000 |
mlr_model_1 %>%
broom::glance() %>%
knitr::kable(digits=3)
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.438 | 0.373 | 12009746 | 6.729 | 0 | 61 | -10386.7 | 20899.4 | 21175.13 | 7.586709e+16 | 526 | 588 |
Model statement: \(highestyearlyearnings =\beta_0+ \beta_1country +\beta_2category +\beta_3videoperupload + beta_4uploads + beta_5(videoviews)\)
mlr_model_2 <- lm(highest_yearly_earnings ~ country + category + video_per_upload+uploads +video_views , data = youtube_df)
youtube_df %>%
modelr::add_predictions(mlr_model_2) %>%
ggplot(aes(x = earning_differences, y = pred)) +
geom_point() +
labs(
title = "Multivariate Linear Model",
x = "highest_yearly_earnings",
y = "predictions") +
theme_pubclean()
<<<<<<< HEAD

check_model(mlr_model_2, check = c("linearity", "outliers", "qq", "normality"))


check_model(mlr_model_2, check = c("linearity", "outliers", "qq", "normality"))

# Summary of the model
mlr_model_2%>%
broom::tidy() %>%
knitr::kable(digits=3)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 8014304.518 | 10263302.263 | 0.781 | 0.435 |
| countryAustralia | 3527574.695 | 10365112.296 | 0.340 | 0.734 |
| countryBarbados | -8457592.238 | 13434533.042 | -0.630 | 0.529 |
| countryBrazil | -684530.551 | 4507957.563 | -0.152 | 0.879 |
| countryCanada | -2491751.885 | 6674826.171 | -0.373 | 0.709 |
| countryChile | -6380571.699 | 8600064.143 | -0.742 | 0.458 |
| countryChina | -3752537.270 | 13934282.922 | -0.269 | 0.788 |
| countryColombia | 1742683.485 | 5844773.275 | 0.298 | 0.766 |
| countryCuba | 34821010.475 | 50492710.081 | 0.690 | 0.491 |
| countryEcuador | -5060952.920 | 9914940.382 | -0.510 | 0.610 |
| countryEgypt | -4991270.448 | 10096665.678 | -0.494 | 0.621 |
| countryEl Salvador | -8021060.834 | 13533245.861 | -0.593 | 0.554 |
| countryFrance | -5907274.618 | 7552940.238 | -0.782 | 0.434 |
| countryGermany | 5389320.770 | 7534790.563 | 0.715 | 0.475 |
| countryIndia | -1169257.163 | 4051324.322 | -0.289 | 0.773 |
| countryIndonesia | -3942928.257 | 4905029.749 | -0.804 | 0.422 |
| countryItaly | 18932374.896 | 9908138.656 | 1.911 | 0.057 |
| countryJapan | 6513549.121 | 7510624.844 | 0.867 | 0.386 |
| countryJordan | -8168177.752 | 8384937.213 | -0.974 | 0.330 |
| countryKuwait | 1045429.364 | 13534568.583 | 0.077 | 0.938 |
| countryLatvia | 40302216.073 | 13478120.831 | 2.990 | 0.003 |
| countryMalaysia | -4575492.584 | 13590537.998 | -0.337 | 0.737 |
| countryMexico | -3738452.955 | 5022598.168 | -0.744 | 0.457 |
| countryNetherlands | -5561365.107 | 9921314.070 | -0.561 | 0.575 |
| countryPakistan | 7858054.990 | 6689844.459 | 1.175 | 0.241 |
| countryPhilippines | -7494453.788 | 6206972.420 | -1.207 | 0.228 |
| countryRussia | -4099482.942 | 5321461.013 | -0.770 | 0.441 |
| countrySamoa | -4543730.517 | 13487688.596 | -0.337 | 0.736 |
| countrySaudi Arabia | -7538239.422 | 6618285.941 | -1.139 | 0.255 |
| countrySingapore | -13992604.701 | 9945946.697 | -1.407 | 0.160 |
| countrySouth Korea | 19806130.071 | 5815161.160 | 3.406 | 0.001 |
| countrySpain | -4250365.290 | 5616672.273 | -0.757 | 0.450 |
| countrySweden | 1148466.395 | 9984181.056 | 0.115 | 0.908 |
| countrySwitzerland | -5922009.800 | 14250646.083 | -0.416 | 0.678 |
| countryThailand | -9887305.924 | 5383299.283 | -1.837 | 0.067 |
| countryTurkey | 9825286.892 | 8420332.253 | 1.167 | 0.244 |
| countryUkraine | -5490115.784 | 7038478.876 | -0.780 | 0.436 |
| countryUnited Arab Emirates | 2867917.580 | 6439903.821 | 0.445 | 0.656 |
| countryUnited Kingdom | -4404027.123 | 4641735.920 | -0.949 | 0.343 |
| countryUnited States | -2325658.305 | 4018277.003 | -0.579 | 0.563 |
| countryVenezuela | -9915108.764 | 13500902.768 | -0.734 | 0.463 |
| countryVietnam | -1125864.331 | 9884578.269 | -0.114 | 0.909 |
| categoryComedy | 292233.597 | 9666599.929 | 0.030 | 0.976 |
| categoryEducation | -5503754.935 | 9776374.626 | -0.563 | 0.574 |
| categoryEntertainment | -2155817.991 | 9559404.715 | -0.226 | 0.822 |
| categoryFilm & Animation | -2608082.406 | 9776469.450 | -0.267 | 0.790 |
| categoryGaming | -3874183.218 | 9676769.155 | -0.400 | 0.689 |
| categoryHowto & Style | -4702253.317 | 10216257.270 | -0.460 | 0.646 |
| categoryMovies | -5590571.371 | 13167804.430 | -0.425 | 0.671 |
| categoryMusic | -5778060.854 | 9575410.093 | -0.603 | 0.546 |
| categorynan | 4742611.242 | 9692029.940 | 0.489 | 0.625 |
| categoryNews & Politics | -5147695.864 | 10226301.494 | -0.503 | 0.615 |
| categoryNonprofits & Activism | -7512888.934 | 13213703.036 | -0.569 | 0.570 |
| categoryPeople & Blogs | -2930450.073 | 9587172.430 | -0.306 | 0.760 |
| categoryPets & Animals | 1424822.996 | 12076639.919 | 0.118 | 0.906 |
| categoryScience & Technology | -5489495.414 | 10252936.348 | -0.535 | 0.593 |
| categoryShows | 845790.093 | 10275856.530 | 0.082 | 0.934 |
| categorySports | -2534034.619 | 10627829.732 | -0.238 | 0.812 |
| categoryTrailers | -11549784.310 | 13169626.543 | -0.877 | 0.381 |
| video_per_upload | -0.002 | 0.002 | -1.055 | 0.292 |
| uploads | 27.269 | 17.091 | 1.595 | 0.111 |
| video_views | 0.001 | 0.000 | 15.543 | 0.000 |
mlr_model_2 %>%
broom::glance() %>%
knitr::kable(digits=3)
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.438 | 0.373 | 12808382 | 6.731 | 0 | 61 | -10424.56 | 20975.11 | 21250.85 | 8.629275e+16 | 526 | 588 |
library(randomForest)
library(caTools)
set.seed(42) # for reproducibility
split <- sample.split(youtube_df$subscribers, SplitRatio = 0.8)
train_data <- subset(youtube_df, split == TRUE)
test_data <- subset(youtube_df, split == FALSE)
# Train a Random Forest model
rf_modelS <- randomForest(subscribers ~ ., data=youtube_df, importance=TRUE)
# Make predictions on the test set
predictions <- predict(rf_modelS, test_data)
# Calculate R^2 score
r2_score <- cor(test_data$subscribers, predictions)^2
# Output the model and R^2 score
print(rf_modelS)
##
## Call:
## randomForest(formula = subscribers ~ ., data = youtube_df, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 6
##
## Mean of squared residuals: 5.343385e+13
## % Var explained: 85.25
We use the same predictors country,
category, video_per_upload,
uploads, video_views and build three models
with response subscribers,
earning_differences, highest_yearly_earnings.
The respective \(r^2\) are \(0.754, 0.438, 0.438\), which means that,
with the specified covariates, the model’s performance in predicting
subscribers is superior to the others. To compare the performance with
multiple linear regression model, we also trained a Random Forest model
to predict the number of subscribers. The data was split into training
data and test data with a proportion of 0.8. The \(r^2\) score of 0.977 significantly
outperforms the result from the linear regression model, where \(r^2\) was 0.754. This suggests that the
Random Forest model explains almost all the variability in the
subscriber count. Based on these results, we conclude that the number of
video views is also one of the most important factors for predicting the
number of subscribers.
We also check the linear regression assumptions (e.g., homoscedasticity, normality of residuals) and observe potential outliers that can impact the model’s accuracy. To enhance model accuracy, we may consider removing these outliers and subsequently explore missing value imputation. In summary, the Random Forest model proves highly effective for this dataset; however, further validation, potentially with different datasets or cross-validation, is advisable to ensure the model’s generalizability and assess the potential impact of overfitting.
*).***).***), indicating a strong association with the
number of subscribers.The model explains a significant amount of the variance in the number of subscribers (as indicated by the R-squared values). ‘video_views’ is particularly a strong predictor. However, most of the individual country and category predictors do not significantly contribute to the model. This might suggest that while overall video views are important, where those views come from (which country) and the content category may not be as important, with the exception of South Korea.
***), indicating a strong association with
‘earning_differences’.The model explains a moderate amount of the variance in ‘earning_differences’. The variable ‘video_views’ stands out as a strong predictor. The country and category variables generally do not significantly predict ‘earning_differences’, with a few exceptions. Notably, the ‘countryLatvia’ coefficient is significant (p = 0.007876), suggesting it has a unique effect on ‘earning_differences’.
The image you’ve uploaded shows the output from a Multiple Linear Regression (MLR) model in R, with ‘highest_yearly_earnings’ as the dependent variable. The model includes ‘country’, ‘category’, ‘video_per_upload’, and ‘video_views’ as independent variables. Here’s an interpretation of the output:
The model has a moderate explanatory power for ‘highest_yearly_earnings’, with ‘video_views’ being a particularly strong predictor. While most country and category variables are not significant on their own, the overall model is significant, suggesting that there is a combination of these variables that helps predict the highest yearly earnings. The significant predictors for ‘countryLatvia’ and ‘countrySouth Korea’ suggest that being in these countries is associated with a significant difference in ‘highest_yearly_earnings’ compared to the baseline country (not shown in the output, likely the reference category).